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ABSTRACT 

Delimited  text  files  are  often  plagued  by  appended  and/or  truncated  records.  Writing  customized 
SAS !<  code  to  import  such  a  text  file  and  break  out  into  fields  can  be  challenging.  If  only  there 
was  a  way  to  “fix”  the  file  before  importing  it. 

Enter  the  filefixingtool,  a  SAS®  Enterprise  Guide®  project  that  uses  the  SAS  PRX  functions  to 
import,  fix,  and  export  a  delimited  text  file.  This  fixed  file  can  then  be  easily  imported  and 
broken  out  into  fields. 

The  file  fixing  tool  project  was  written  using  SAS  Enterprise  Guide  5.1  and  SAS  9.3  on  a 
Windows  7  Enterprise  operating  system. 


INTRODUCTION 

Delimited  text  files  can  be  considered  “broken”  when  they  contain  appended  and/or  truncated  records. 

It  is  common  knowledge  that  SAS’s  PRX  functions  are  powerful  tools  for  turning  messy  text  into 
structured  data.  But  can  they  be  used  in  a  program  for  restructuring  a  broken  delimited  text  file  into  a  file 
that  can  be  imported  into  SAS?  This  paper  will  show  the  answer  to  this  question  is  yes. 

The  following  topics  will  be  discussed: 

•  Defining  a  delimited  text  file 

•  Delimited  text  file  structure  (both  normal  and  broken) 

•  Text  qualifying 

•  The  truncated-only  method  for  fixing  broken  delimited  text  files  containing  no  appended  records 

•  The  appended  method  for  fixing  broken  delimited  text  files  containing  appended  and/or  truncated 
records 

•  Common  surrounding  characters  (CSCs) 

•  Setting  up  the  file  fixing  tool 

•  Running  the  file  fixing  tool 

•  Testing  the  f i I e_f ix i n g_to o I  using  one  of  the  33  test  case  files 

•  Before  and  after  examples  of  fixed  delimited  text  test  files 

•  Helpful  reminders  and  other  important  information 
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DEFINING  A  DELIMITED  TEXT  FILE 


Although  delimited  text  files  exist  in  a  wide  variety  of  formats,  they  all  contain  the  same  basic  structure  of 
records  separated  by  record  separators  and  the  record’s  fields  separated  by  field  delimiters.  Common 
record  separators  are  a  newline  character,  a  carriage  return  character,  or  a  combined  newline  and 
carriage  return  character.  Common  field  delimiters  are  comma,  tab,  colon,  semicolon,  pipe,  caret,  etc. 

A  very  common  delimited  text  file  format  is  the  CSV  (comma-separated  values)  format,  but  it  has  various 
specifications  and  implementations  (Shafranovich,  2005).  Shafranovich’s  RFC4180  document 
(http://tools.ietf.org/html/rfc4180)  contains  a  definition  for  the  CSV  format  that,  as  he  states,  “seems  to  be 
followed  by  most  implementations.”  His  definition  consists  of  the  following  rules: 

Note:  For  the  below  rules,  CRLF  =  the  record  separator  and  comma  =  the  field  delimiter 

1.  Each  record  is  located  on  a  separate  line,  delimited  by  a  line  break  (CRLF).  For  example: 
aaa,bbb,ccc  CRLF 

zzz,yyy,xxx  CRLF 

2.  The  last  record  in  the  file  may  or  may  not  have  an  ending  line  break.  For  example: 
aaa,bbb,ccc  CRLF 

zzz,yyy,xxx 

3.  There  may  be  an  optional  header  line  appearing  as  the  first  line  of  the  file  with  the  same  format  as 
normal  record  lines.  This  header  will  contain  names  corresponding  to  the  fields  in  the  file  and  should 
contain  the  same  number  of  fields  as  the  records  in  the  rest  of  the  file  (the  presence  or  absence  of 
the  header  line  should  be  indicated  via  the  optional  "header"  parameter  of  this  MIME  type).  For 
example: 

field_narne,field_name,field_name  CRLF 
aaa,bbb,ccc  CRLF 
zzz,yyy,xxx  CRLF 

4.  Within  the  header  and  each  record,  there  may  be  one  or  more  fields,  separated  by  commas.  Each 
line  should  contain  the  same  number  of  fields  throughout  the  file.  Spaces  are  considered  part 

of  a  field  and  should  not  be  ignored.  The  last  field  in  the  record  must  not  be  followed  by  a  comma. 

For  example: 

aaa,bbb,ccc 

5.  Each  field  may  or  may  not  be  enclosed  in  double  quotes  (however,  some  programs,  such  as 
Microsoft  Excel,  do  not  use  double  quotes  at  all).  If  fields  are  not  enclosed  with  double  quotes,  then 
double  quotes  may  not  appear  inside  the  fields.  For  example: 

"aaa","bbb","ccc"  CRLF 
zzz,yyy,xxx 

6.  Fields  containing  line  breaks  (CRLF),  double  quotes,  and  commas  should  be  enclosed  in  double 
quotes.  For  example: 

"aaa'V'b  CRLF 
bb","ccc"  CRLF 
zzz,yyy,xxx 

7.  If  double  quotes  are  used  to  enclose  fields,  then  a  double  quote  appearing  inside  a  field  must  be 
escaped  by  preceding  it  with  another  double  quote.  For  example: 

"aaa","b""bb","ccc" 

For  the  purposes  of  the  file  fixing  tool,  we  need  to  add  two  additional  rules: 

1.  Other  characters  can  be  used  as  the  field  delimiter.  The  following  field  delimiters  have  been  tested 
using  the  file_fixing_tool: 

a.  comma 

b.  caret 

c.  colon 

d.  semicolon 

e.  pipe 

f.  tab 
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Note:  If  your  broken  delimited  text  file  uses  a  field  delimiter  other  than  one  of  the  above  six,  then 
please  see  the  first  bullet  in  the  section  labelled  “Helpful  reminders  and  other  important  information.” 

2.  Either  the  double  quote  or  the  single  quote  can  be  used  for  text  qualifying  (i.e. ,  enclosing)  a  field  and 
they  both  cannot  be  used  for  text  qualifying  within  the  same  file. 

Your  broken  delimited  text  file  must  adhere  to  the  above  9  rules.  Otherwise,  there  is  no  guarantee  that 
the  file_fixing_tool  will  fix  your  broken  file. 


DELIMITED  TEXT  FILE  STRUCTURE  (BOTH  NORMAL  AND  BROKEN) 

As  previously  stated,  a  delimited  text  file  contains  records  that  are  separated  by  record  separators,  and 
these  records  contain  fields  that  are  separated  by  field  delimiters.  An  appended  record  occurs  when  a 
record  is  not  immediately  followed  by  a  record  separator.  A  truncated  record  occurs  when  a  record 
contains  a  record  separator  within  the  record.  Sometimes  a  delimited  text  file  contains  both  appended 
and  truncated  records.  Let’s  use  the  following  examples  to  illustrate. 

If  we  were  to  visualize  an  unbroken  delimited  text  file  containing  no  appended  records  and  no  truncated 

records,  it  would  resemble  the  following  where  [ - ]  is  one  complete  record  and  <rs>  is  the  record 

separator: 

[— - ]<rs> 

[ - ]<rs> 

[ - ]<rs> 

[— - ]<rs> 

Notice  that  the  file’s  record  separator  is  correctly  located  at  the  end  of  each  record,  which  causes  each 
line  to  contain  exactly  one  record. 

A  broken  delimited  text  file  containing  only  appended  records  would  resemble  this: 

[ - .][. - ]<rs> 

[ - ][ - ][ - ]<rs> 

[ - ]<rs> 

[ - .][. - ]<rs> 

Notice  that  the  lack  of  record  separators  after  records  1,  3,  4,  and  7  causes  the  appended  records. 

A  broken  delimited  text  file  containing  only  truncated  records  would  resemble  this: 

[ — <rs> 

-]<rs> 

[ - ]<rs> 

[— <rs> 

— ]<rs> 

[ - ]<rs> 

Notice  that  record  separators  occurring  within  records  1  and  3  cause  the  truncated  records. 

And  lastly,  a  broken  delimited  text  file  containing  both  appended  and  truncated  records  would  resemble 
this: 

[ - .][ — <rs> 

-]<rs> 
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[— -]<rs> 

[— <rs> 

--][■ - ][ - ]<rs> 

[— <rs> 

-]<rs> 

Notice  that  the  lack  of  record  separators  after  records  1, 4,  and  5  causes  the  appended  records  while 
record  separators  occurring  within  records  2,  4,  and  7  cause  the  truncated  records. 


TEXT  QUALIFYING 

Only  the  double  quote  or  single  quote  can  be  used  as  a  text  qualifier  to  surround  a  cell  (i.e.,  an  individual 

field  contained  within  one  record),  and  both  cannot  be  used  as  text  qualifiers  within  the  same  delimited 

text  file. 

Text  qualifying  MUST  take  place  if  either  of  the  following  two  conditions  occurs: 

1 .  A  delimited  text  file  contains  a  cell  in  which  there  exists  a  field  delimiter.  A  cell  with  this  condition 
must  be  text  qualified  (i.e.,  contain  surrounding  text  qualifiers). 

2.  A  delimited  text  file  contains  a  text-qualified  cell  in  which  there  exists  the  text-qualifier  character.  A 
text-qualifier  character  occurring  in  such  a  cell  must  be  escaped  with  another  text-qualifier  character. 
In  other  words,  it  takes  two  text-qualifier  characters  to  mean  one  (i.e.,  two  double  quotes  to  mean  one 
double  quote  and  two  single  quotes  to  mean  one  single  quote).  A  single  unescaped  text-qualifier 
character  occurring  within  a  text-qualified  cell  will  throw  off  the  column  alignment  for  that  record  when 
the  delimited  text  file  is  imported  into  SAS. 

Note:  A  cell  CAN  be  text  qualified  without  meeting  one  of  the  above  two  conditions. 


TRUNCATED-ONLY  METHOD 

This  method  can  only  be  used  on  broken  delimited  text  files  containing  truncated  records  and  no 
appended  records.  But  what  is  the  definition  of  an  appended  record? 

Within  a  broken  delimited  text  file,  an  appended  record  occurs  when  the  beginning  of  any  record  occurs 
on  a  line  containing  any  characters  from  the  previous  record.  In  other  words,  a  record  separator  does  not 
exist  between  two  records  (as  it  should). 

Besides  the  “no  appended  records”  requirement,  there  is  one  other  requirement  for  using  this  method, 
and  that  requirement  is  the  following: 

Within  the  broken  text  file,  the  first  field  of  every  record  and  its  immediately  following  field  delimiter  (i.e., 
the  field  delimiter  occurring  between  the  first  and  second  fields)  must  occur  on  the  same  line. 

In  other  words,  a  record  separator  cannot  exist  between  a  record’s  first  field  and  its  immediately  following 
field  delimiter.  Without  this  requirement,  the  tool  will  not  be  able  to  determine  when  it  has  encountered  a 
new  record  and  will  treat  the  new  record’s  first  field  fragment  as  part  of  the  previous  record’s  last  field. 

The  chances  of  the  first  field  being  so  long  that  it  would  be  truncated  are  slim,  since  the  vast  majority  of 
delimited  text  files  locate  their  identifier  or  key  variables  among  the  beginning  fields  while  their  variables 
containing  long  text  strings  are  located  among  the  ending  fields. 

Unlike  the  appended  method,  the  truncated-only  method  does  not  use  your  last_field  and  first_field 
patterns  (i.e.,  the  patterns  contained  in  your  two  macro  variables)  to  fix  your  delimited  text  file.  It  uses  a 
built-in  regular  expression  to  identify  and  count  the  number  of  fields  encountered  up  until  the  last  field,  at 
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which  point  another  regular  expression  is  used  to  deal  with  separating  the  record’s  last  field  from  the 
following  record’s  first  field. 


APPENDED  METHOD 

Fixing  a  broken  delimited  text  file  containing  only  truncated  records  is  relatively  easy.  Fixing  one 
containing  appended  records  (with  or  without  truncated  records)  is  more  difficult  and  sometimes 
impossible. 

The  key  to  fixing  a  delimited  text  file  containing  appended  records  is  developing  a  pattern  (i.e. ,  a  perl 
regular  expression  segment)  that  isolates  either  the  file’s  first  field  or  a  pattern  that  isolates  its  last  field 
from  the  rest  of  the  file.  (The  file’s  other  fields  are  already  isolated  by  their  surrounding  field  delimiters.) 
In  other  words,  with  this  pattern,  we  are  able  to  identify  throughout  the  file  the  location  at  which  one 
record  ends  and  the  next  record  begins.  Identifying  this  location  is  critical  because  this  is  the  only 
location  throughout  the  text  file  that  should  contain  a  record  separator  instead  of  a  field  delimiter,  which 
would  be  expected  since,  as  we  have  mentioned,  this  location  is  the  end  of  one  record  and  the  beginning 
of  another.  In  fact,  any  delimited  text  file  with  appended  records  can  be  fixed  if  this  location  can  be 
identified  no  matter  how  appended  or  truncated  the  records  in  your  file  may  be. 


CREATING  YOUR  LAST_FIELD  AND  FIRST_FIELD  PATTERNS  FOR  THE  APPENDED  METHOD 

To  identify  and  isolate  your  broken  file’s  last  field  from  its  first  field,  you  need  to  enter  patterns  in  the  form 
of  perl  regular  expression  segments  for  the  macro  variables  last_field  and  first_field.  These  patterns  will 
depend  on  your  knowledge  of  the  field’s  contents  and  how  close  your  pattern  comes  to  matching  those 
contents. 

Here  are  a  few  examples  of  field  contents  and  their  matching  patterns. 

NOTE:  The  term  “visibly  blank”  means  having  0  or  more  spaces. 

1.  Social  Security  number 

a.  Contents:  a  9-digit  character  string 

b.  Pattern:  \d{9} 

2.  Social  Security  number  with  some  cells  being  visibly  blank 

a.  Contents:  a  9-digit  character  string  or  visibly  blank 

b.  Pattern:  (\d{9}|  *)  where  pipe  is  the  “or”  metacharacter  and  <space>*  means  0  or  more  spaces 

c.  Note:  The  file_fixing_tool  requires  the  use  of  “non-capturing  group”  parentheses  whenever 
parentheses  are  used  within  your  last  field  and  first  field  patterns,  so  the  pattern  would  have  to 
be  entered  as: 

(?:\d{9}|  *) 

3.  Categorical  data 

a.  Contents:  A  string  containing  either  red,  white,  blue,  or  visibly  blank 

b.  Pattern:  (?:red|white|blue|  *) 

Note#2:  Non-capturing  group  parentheses  MUST  always  surround  an  “or”  operation  or  a 
continuous  series  of  “or”  operations  like  the  above  pattern. 

4.  Categorical  data  (case-insensitive) 

a.  Contents:  A  string  containing  red,  white,  blue,  RED,  WhiTE,  BluE,  etc...,  or  visibly  blank 

b.  Pattern:  (?i:red|white|blue|  *) 
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Note:  The  ‘i’  in  ‘(?i:’  makes  the  contents  case-insensitive  (Dunn,  2011). 

5.  A  number  between  1  and  1000000  (one  million) 

a.  Contents:  A  character  string  containing  between  1  and  7  digits  or  visibly  blank 

b.  Pattern:  (?:\d{1 ,7}|  *) 

6.  Date  (using  a  very  forgiving  pattern) 

a.  Contents:  A  date  string  with  format  MM/DD/YYYY  or  visibly  blank 

b.  Pattern:  (?:\d{2}\Ad{2}\Ad{4}|  *) 

Note:  Use  more  “forgiving”  patterns  when  your  data  are  messy  and  error  laden.  For  example, 
the  above  pattern  will  match  the  string  55/77/9999,  but  that  is  ok,  since  our  goal  is  to  fix  the 
broken  file  and  not  identify  bogus  date  data.  If  our  pattern  is  too  precise,  like  the  next  example, 
then  the  file  fixing  tool  will  fail. 

Note#2:  Although  the  forward  slash  is  not  a  metacharacter,  it  must  be  escaped  (i.e.,  back- 
slashed),  since  the  file  fixing  tool  uses  it  as  a  regular  expression  delimiter  in  its  processing. 

7.  Date  (incorrect  use  of  a  precise  and  not  very  forgiving  pattern) 

a.  Contents:  A  date  string  with  format  MM/DD/YYYY  or  no  contents 

b.  Pattern:  (?:(?:0?[1 -9]|  1  [01 2])V(?:0?[1 -9]|[1 2][0-9]|3[01 2])V(?:1 9|20)\d\d|  *) 

Note:  WRONG!  Precision  like  this  will  cause  the  file  fixing  tool  to  fail,  since  the  pattern  will  match 
a  real-life  date  and  will  not  match  a  string  like  55/77/9999.  The  f i I e_f ix i n g_to o I  needs  to  be  able  to 
match  any  existing  junk  data  if  it  exists  with  the  correct  format.  Save  this  kind  of  precision  for 
downstream  programming  involving  identifying  bogus  dates.  It  has  no  place  in  the  fixing  of 
broken  delimited  text  files. 


COMMON  SURROUNDING  CHARACTERS  (CSCs) 

Common  surrounding  characters  (CSCs)  are  characters  that  can  exist  on  the  edges  of  your  field  (i.e.,  cell) 
yet  still  are  contained  within  the  field’s  contents. 

For  text-qualified  fields: 

1 .  They  can  exist  just  inside  the  text  qualifiers  surrounding  your  field. 

2.  They  can  be  any  of  the  following  characters: 

a.  The  field  delimiter  character 

b.  A  pair  of  text-qualifier  characters  (meaning  a  successive  EVEN  number  of  text-qualifier 
characters)  that  does  not  include  the  field’s  two  actual  surrounding  text  qualifiers 

c.  The  space  character 

d.  The  text  qualifier  character  NOT  being  used  as  the  text  qualifier.  For  example,  if  the  double 
quote  was  being  used  as  the  text  qualifier,  then  the  one  not  being  used  would  be  the  single  quote 
and  vice  versa. 

For  non  text-qualified  fields: 

1 .  They  can  exist  just  inside  the  field  delimiters. 

2.  They  can  be  any  of  the  following  characters: 

a.  The  double  quote  character 

b.  The  single  quote  character 

c.  The  space  character 

CSCs  can  occur  in  any  of  your  broken  file’s  fields  and  they  do  not  need  to  be  accounted  for  within  your 
last  field  or  first  field  patterns  when  running  the  appended  method. 
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So  what  are  the  benefits  of  automatically  accounting  for  CSCs  in  your  last  field’s  or  first  field’s  contents? 

There  are  several  benefits: 

1 .  These  characters  tend  to  occur  quite  often  on  the  edge  of  fields  whether  the  field  is  considered  to  be 
text  qualified  or  not. 

2.  If  the  file  fixing  tool  did  not  automatically  account  for  them  occurring  in  your  last  field  or  first  field 
contents,  then  they  would  need  to  be  accounted  for  in  your  last  field  and  first  field  patterns.  By 
accounting  for  them,  you  are  always  covered. 

3.  Accounting  for  them  does  not  affect  the  match  since  the  pattern  will  account  for  them  only  when  they 
are  present  in  your  field. 

Note:  It  is  extremely  important  to  remember  that  only  CSCs  occurring  on  the  edges  of  the  field 
contents  (i.e.,  outside  any  non-CSC  characters)  are  automatically  accounted  for  by  the 
file  fixing  tool.  If  they  occur  inside  of  non-CSC  characters,  then  they  need  to  be  accounted  for  within 
the  field  patterns  contained  in  your  last_field  or  first_field  macro  variables. 


SETTING  UP  THE  FILE_FIXING_TOOL 

The  following  are  steps  for  setting  up  the  file  fixi ng  tool : 

1.  Create  a  directory  named  “f i I e_f i x i n g  too I  d i r”  on  your  desktop. 

2.  Within  this  directory,  create  the  following  two  subdirectories: 

a.  infile  dir  -  will  contain  broken  delimited  text  files  of  the  following  types: 

1)  Test  case  files  that  come  with  the  file_fixing_tool  (33  test  case  files) 

2)  Your  own  broken  delimited  text  files  that  need  fixing 

b.  results_dir  -  fixed  delimited  text  files  will  be  sent  here  and  be  named  one  of  the  following: 

1)  APPEND_METHOD_RESULTS.txt 

2)  TRUNCATED_ONLY_METHOD_RESULTS.txt 

3.  Copy  the  file_fixing_tool  SAS  Enterprise  Guide  project  to  the  file  fixing  tool  dir  directory. 

4.  Copy  all  33  test  case  files  to  the  infile_dir  directory. 

5.  Open  the  file_fixing_tool  SAS  EG  project. 

6.  Within  the  following  three  process  flows,  change  the  path  of  the  imported  fixed  file  to  the  FULL  PATH 
of  your  results_dir  directory: 

a.  Appended  Method  Testing,  NoText  Qualifying 

b.  Appended  Method  Testing,  Text  Qualifying 

c.  Truncated-Only  Method  Testing 

7.  Open  the  process  flow  named  “Set  macro  variables  and  compile  macros”  and  open  the  program 
named  “everything_macro”  and  enter  your  path  for  the  macro  variable  named  dir.  Make  sure  to 
include  the  backslash  at  the  end  of  your  path.  For  example: 

C:\Users\genovePC\Desktop\file  fixing  tool\ 

8.  Setup  is  now  complete. 

RUNNING  THE  FILE_FIXING_TOOL 

The  header  record  (if  one  exists  at  the  top  of  your  broken  text  file  file)  must  be  removed  and  stored  before 
running  the  file  fixing  tool,  since  it  will  interfere  with  the  regular  expression  matching.  After  you  have  run 
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the  file_fixing_tool  and  fixed  your  broken  file,  simply  prepend  the  header  record  back  to  the  top  of  the 
fixed  file.  I,  the  author  of  this  paper,  apologize  for  not  adding  this  functionality.  I  simply  ran  out  of  time. 
Coding  this  functionality  is  a  lot  trickier  than  it  would  seem  and  the  costs  might  outweigh  the  benefits. 
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To  run  the  file_fixing_tool,  do  the  following: 

1.  Open  SAS  Enterprise  Guide  and  open  the  file  fixing  tool. 

2.  Open  the  process  flow  named  ’’Set  macro  variables  and  compile  macros”  and  open  the  program 
named  “everything_macro.” 

3.  You  are  only  allowed  to  modify  the  following  macro  variables  contained  in  the  area  within  the 
“MODIFIABLE  CODE”  boundaries: 


#  MODIFIABLE  CODE:  BEGIN  # 


%let  contains_text_qualifying  =  %str(y); 

%let  fld_dlm  =  %nrstr(\x2C);  /*  Identify  the  field  delimeter.  Use  \x2C  for  comma  7 
%let  num_fields  =  %str(5);  /*  Number  of  fields  in  the  file.  7 

%let  txq  =  %str(%");  /*  %"  for  double  quote,  %'  for  single  quote,  or  n  for  no  text  qualifying  7 

%let  run_appended_method  =  %str(n);  /*  y  =  run,  n  =  don't  run  7 

%let  run_truncated_only_method  =  %str(y);  /*  y  =  run,  n  =  don't  run  7 

%let  input_line_stop  =  %nrstr(=~##);  /*  Used  for  marking  the  ends  of  inputted  text  file  lines  7 
%let  in  filejrecl  =  %str(500);  /*  Length  of  character  variables  used  in  processing  7 

%let  out  file  lrecl  =  %str(1000);  /*  Length  of  output  file  records  7 

%let  pre_trunc_num_spaces  =  %str(20);  /*  Used  for  truncation  detection  7 

%let  dir  =  %nrstr(C:\Users\genovePC\Desktop\file_fixing_tool\); 

%let  infile  dir  =  %str(&dir)%str(infile_dir\); 

%let  results_dir  =  %str(&dir)%str(results_dir\); 

%let  in  file  =  %nrstr(TCases_SomeTxq_Empty_Spaces_EmbDlm_EmbTxq_trunc.txt);  /*  7 

%let  fi rst  field  =  %nrstr((?:(?i:[A-Z]+?)|  *?));  /*  used  only  for  the  appended  method  7 

%let  last  field  =  %nrstr((?i:red|white|blue|  *));  /*  used  only  for  the  appended  method  7 

#  MODIFIABLE  CODE:  END  # 


Notice  that  the  last  two  macro  variables  (first  field  and  last  field)  are  used  by  the  appended  method  only 
and  are  ignored  by  the  truncated-only  method. 

4.  Set  each  macro  variable  according  to  the  following  definitions: 

a.  contains_text_qualifying  -  set  toy  or  n. 

1)  If  set  to  y,  then  the  file  fixing  tool  will  process  the  file  as  a  text-qualified  file.  You  must 
also  set  the  macro  variable  txq  to  the  text  qualifier  being  used. 

2)  If  set  to  n,  then  the  f i I e_f ix i n g_to o I  will  process  the  file  as  a  delimited,  non-text-qualified 
file,  and  the  txq  macro  variable  will  not  be  used  during  processing. 

b.  flddlm  -  field  delimiter  used  in  your  broken  file.  For  example:  ,  <tab>  :  ;  |  A  etc... 

Note:  When  your  field  delimiter  is  a  comma,  use  its  hex  value  (i.e.,  \x2C).  Otherwise,  it  will  be 
misinterpreted  within  the  %nrstr()  macro  function  as  a  separator  of  arguments  to  the  macro 
function. 

c.  num_fields  -  number  of  fields  contained  in  your  broken  file. 

d.  txq  -  the  text  qualifier  used  (if  file  contains  text  qualifying).  Use  one  of  the  following  : 
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1)  %”  -  double  quotes  used  as  text  qualifier 

2)  %’  -  single  quotes  used  as  text  qualifier 

3)  n  -  no  text  qualifying 

e.  run_appended_method  -  you  must  enter  either  y  or  n. 

f.  run_truncated_only_method  -  you  must  enter  either  y  or  n. 

Note:  You  cannot  run  the  truncated-only  method  on  a  broken,  delimited  text  file  containing 
appended  records.  Doing  so  will  create  bogus  results. 

g.  inputJine_stop  -  keep  the  already-entered  string  of  =-##  This  string  is  used  during  the 
importing  of  your  broken  file  to  mark  the  end  of  each  line  (i.e.,  where  a  record  separator  (aka  line 
stop)  occurred).  If  your  file  happens  to  contain  the  string  =-##,  then  choose  another  string,  but 
be  careful  since  you  need  to  choose  characters  that  are  not  regular  expression  metacharacters  or 
characters  that  are  not  special  in  the  SAS  Macro  programming  language  such  as  %  &. 

h.  in_file_lrecl  -  a  number  used  to  make  sure  that  truncation  does  not  occur  during  the  importing  of 
your  broken  file.  If  you  receive  a  truncation  error,  then  increase  this  value. 

i.  out_file_lrecl  -  a  number  used  to  make  sure  that  the  fixed  record  outputted  to  your  fixed  file  is 
not  truncated.  It  is  also  used  as  length  values  for  certain  variables  used  in  creating  your  fixed 
record. 

j.  pre_trunc_num_spaces  -  a  number  used  for  detecting  truncation. 

k.  dir  -  the  directory  path  containing  the  directories  infiledir  and  results_dir. 

l.  infile  dir  -  the  directory  containing  your  broken,  delimited  text  file. 

m.  results  dir  -  the  directory  where  your  fixed,  delimited  text  file  will  be  sent. 

n.  in  file  -  the  name  of  your  broken,  delimited  text  file.  Don’t  forget  the  “.txt”  suffix  in  your  file 
name. 

o.  first_field  -  a  pattern  (in  the  form  of  a  perl  regular  expression  segment)  that  matches  the 
contents  of  the  first  field  but  not  the  contents  of  the  previous  record’s  last  field. 

p.  last  field  -  a  pattern  (in  the  form  of  a  perl  regular  expression  segment)  that  matches  the  contents 
of  the  last  field  but  not  the  contents  of  the  next  record’s  first  field. 

5.  You  are  now  ready  to  run  the  file  fixing  tool  by  clicking  the  following: 

File  ->  Run  file_fixing_tool 


TESTING  THE  FILE_FIXING_TOOL  USING  ONE  OF  THE  33  TEST  CASE  FILES 

There  are  33  test  cases  in  the  form  of  delimited  text  files  that  are  used  for  testing  the  file  fixing  tool. 
Each  test  case  file: 

•  contains  5  fields 

•  contains  21  records 

•  contains  a  list  within  its  contents  (see  below  picture)  of  macro  variable  settings  needed  for 
running  the  file  fixing  tool  on  the  test  case  file 
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TCaie_003_comma_FF_uc_LFJc_Emptyflds_$p«c«  -  Notepad 


File  Edit  Format  View 

DECEMBER, ---aaa , 
January, — aaa, 

- aaa, 

march  , - aaa, 

, - aaa , 

may, - aaa, 

- aaa, 

- aaa, 

, . aaa, 

September, — aaa, 
(October  .--aaa, 

^  November, — aaa, 
DECEMBER , ---aaa , 

, - aaa, 

FEBRUARY, ---aaa, 

march, - aaa, 

APRIL  , — aaa, — 

may  , - aaa ,  — 

JUNE, - aaa. - 

, - aaa, - 

august, - aaa, - 


Help 

First  Field  contents:  all  uc  empties  spaces 
Last  Field  contents:  all  lc  empties  spaces 
XI et  contains_text_qua1ifying  -  Xstr(n); 

Xlet  fld_dlm  =  %nrstr(\x2C); 

Xlet  minefields  -  Xstr(S); 

Xlet  txq  =  Xstr(n); 

Xlet  run_appended_method  -  Xstrl 

Xlet  run_truncated_only_jnethod  =  Xstrl 

Xlet  first_field 
Xlet  last 

Xlet  in_file  -  Xnrstr( 

TCase_003_connna_FF_uc_LF_lc_EmptyFlds_spaces.  txt); 


(y); 
Cy); 

,t_f1eld  =  Xnrstr ((?:  [A-zl+ 1  •)); 
:_field  -  Xnrstr(Q?:  [a-zj+ I  *);; 


Ln  1,  Col  1 


•  exists  in  an  “already-fixed”  state  meaning  it  contains  no  appended  or  truncated  records 

Note:  The  4  test  case  files  whose  names  begin  with  ‘TC_T’  do  not  exist  in  an  “already-fixed” 
state.  We  will  explain  why  shortly. 

Most  of  our  test  case  files  exist  in  an  “already-fixed”  state  because,  as  previously  stated,  the  key  to  fixing 
a  broken  delimited  text  file  containing  appended  records  (with  or  without  truncated  records)  is  dependent 
upon  developing  last  field  and  first  field  patterns  that  identify  and  isolate  the  last  and  first  fields.  The 
amount  of  appending  and/or  truncating  occurring  in  your  broken  delimited  text  file  is  not  a  factor  when 
using  the  f i I e  f ix i n g_to o I  to  fix  it. 

If  your  last  field  and  first  field  patterns  are  successful  in  identifying  and  isolating  either  of  these  fields, 
then  the  “already-fixed”  file  will  remain  in  this  state. 

If  not,  then  the  file  fixing  tool  will  destroy  this  state. 


TESTING  THE  FILE_FIXING_TOOL  USING  ‘TC_T  FILES 

The  ‘TC_T’  files  are  “truncated-only”  or  “truncated  and  appended”  versions  of  the  files  whose  test  case 
number  they  contain  in  their  name.  For  example,  the  following  two  files  are  identical  except  that  one  of 
them  is  in  an  “already-fixed”  state  and  the  other  contains  truncated  records: 

TCase_008_comma_FF_ucJc_LF_ucJc_EmptyFlds_Spaces.txt 

TC_T _ 008_TRUNCATED_ONLY_comma_FF_ucJc_LF_ucJc_EmptyFlds_Spaces.txt 

The  ‘TC_T’  files  are  used  for  ensuring  that  the  file_fixing_tool  fixes  broken  delimited  text  files  that  actually 
contain  appended  records  and/or  truncated  records. 

When  running  the  file  fixing  tool  on  one  of  the  ‘TC_T’  files,  set  the  macro  variables  to  the  same  values 
as  the  number  of  the  test  file  contained  in  the  ‘TC_T’  test  file’s  name.  For  example,  the  following  two  files 
should  have  the  same  macro  variable  settings  since,  as  we  have  already  mentioned,  they  are  identical 
files  but  one  has  truncated  records: 

TCase_008_comma_FF_uc_lc_LF_uc_lc_EmptyFlds_Spaces.txt 

TC_T_008_TRUNCATED_ONLY_comma_FF_ucJc_LF_ucJc_EmptyFlds_Spaces.txt 

You  will  get  the  exact  same  results  when  running  the  f i I e_f ix i n g  too I  on  either  of  these  files  since  they  are 
the  same  exact  file  when  they  are  both  in  their  fixed  states. 
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RUNNING  THE  FILE_FIXING_TOOL  ON  ONE  OF  THE  33  TEST  CASE  FILES 

1 .  Open  the  test  case  file  and  notice  how  the  third  field  contains  information  dealing  with  macro  variable 
settings. 

2.  Open  the  everything_macro  program  and  set  your  macro  variables  to  the  settings  contained  in  your 
test  case  file. 

3.  Within  the  everything_macro  program  and  directly  below  the  “MODIFIABLE  CODE”  area,  you  will  see 
another  area  between  labels  “TESTING  AREA:  BEGIN”  and  “TESTING  AREA:  END.” 

4.  Within  this  area,  uncomment  one  %let  statement  for  the  fi rst  field  macro  variable  and  one  for  the 
last  field  macro  variable  for  the  test  case  that  you  are  running.  Each  %let  statement  has  an 
identifying  comment  to  its  right. 

5.  You  are  now  ready  to  run  the  file  fixing  tool  on  your  test  case  file. 


IMPORTING  THE  FIXED  TEST  FILE 

Whenever  the  file _fixing_tool  is  run  using  the  appended  method  on  one  of  the  33  test  case  files,  the 
resulting  fixed  file  (contained  in  APPEND_METHOD_RESULTS.txt)  is  automatically  imported  into  SAS 
and  broken  out  into  its  5  fields. 

Each  of  the  33  test  case  files  has  its  own  file  import  process  within  one  of  the  following  two  process  flows: 

1.  Appended  Method  Testing,  No  Text  Qualifying 

2.  Appended  Method  Testing,  Text  Qualifying 

The  following  is  a  picture  of  the  Project  Tree  showing  the  two  process  flows  and  their  contents. 
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x  Appended  Method  Testing,  NoText  Qualifying 


Project  Tree 

3 Set  macro  variables  and  compile  macros  > 
B  "Cj  Programs 

1 . [^1  everythingjnacro 

3  Add  markers,  import  text  file,  and  test  fortrunr 
l-i  Fix  dataset  -  Appended  Method 
3  Fix  dataset  -  Truncated -only  Method 
3 -8^g  Appended  Method  Testing.  NoText  Qualifyini 
E)  a  APPEND_METHOD_RESULTS.txt 
h#  tcase_00 1  _comma 
j— tcase_002_comma 
!•••■ tcase_003_comma 
tcase_0Q4_comma 
tcase_005_comma 
tcase_006_caret 
tcase_007_colon 
j~-^P  tcase_008_comma 

I  tc_t _ 008_comma_trunc_test 

h#  tc_ta_008_comma_trunc_append_te 
|™#  tcase_009_pipe 
!••••#  tcase_010_semicolon 
tcase_011_tab 

-] -8^g  Appended  Method  Testing,  Text  Qualifying 
B-^j  APPEND_METHOD_RESULTS.txt 
j--#  tcase_012_comma 
h  tcase_013_comma 
h#  tcase_014_comma 
!••••#  tcase_015_comma 
h#  tcase_016_comma 
h#  tcase_017_comma 
h#  tcase_018_caret 
r  tcase_019_caret 
j  tcase_020_colon 
|~ tcase_021_colon 
!  •#  tcase_022_comma 

!  ^3  tc_t _ 022_comma_trunc_test 

tc_ta_022_comma_trunc_append_te 
j-i^p  tcase_023_comma 
h^l  tcase_024_pipe 
!—^]  tcase_025_pipe 
t^I  tcase_026_semicolon 

j  rrr  \ 


t>  Run  ▼  J  Stop  Export  ▼  Schedule  ▼  Zoom 


The  use  of  SAS  Enterprise  Guide’s  conditional  processing  ensures  that  only  the  import  process  for  the 
test  case  file  being  run  will  execute. 

No  test-related  import  process  in  either  of  the  above  two  process  flows  is  executed  when  the 
file_fixing_tool  is  run  on  a  delimited  text  file  other  than  one  of  the  33  test  files.  In  other  words,  every 
import  process  within  these  two  process  flows  is  for  testing  only. 

The  process  flow  named  “Truncated-Only  Method  Testing”  contains  only  the  following  two  import 
processes: 
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Project  Tree 

f  Sm  Set  macro  variables  and  compile  macros 
+i  Add  mariners,  nrport  text  fie.  and  test  for  truncation 

5-1  Svjj  Rx  dataset  -  Appended  Method 
v  Rx  dataset  •  Ton  cat  ed -only  Method 
±1  Appended  Method  Testing.  NoTexl  Qualifying 

i  fiog  Appended  Method  Testing.  Text  Qualifying 
-I  Truncated-Only  Method  Testing 

E  ^  TRUNCATED_ONLY_METHOD_RESULTS  txt 
tc_t_008 Jnjncated_only 
^  tc_t _ Q22_truncated_only 


BEFORE  AND  AFTER  EXAMPLES  OF  FIXED  DELIMITED  TEXT  TEST  FILES 
EXAMPLE  #1 

The  following  broken  delimited  text  file  contains  truncated  records  but  no  appended  records: 
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The  following  is  the  above  broken  delimited  text  file  having  been  fixed  using  the  appended  method: 


File  £de  Forme  Vim  FMp 

JAMJUAY 

.RAUCH.* 

JULY**  *' 


J  lift  MOST'  .  ■ 

l‘OCT0€€«  * 

N  'NDSGVBCR** 


fl  ‘  .  .FEBftUAAV  ' 


)UBS  ‘  V 

‘AUGUST 


VC 

L  ii  t  cl* Id  CCr'ilSr'ltS  .  wC  1C  SCrti_l xQ_T iildl 

Met  CCnta1n3_text -qualifying  .  \jtr(y); 

\1«t  f1d_d1e  -  wstr(  *2c); 

\let  ri**_f1e1ds  -  \str(5); 

Met  t*q  -  Vstr  (vd4>); 

Met  rur\_appended_nethod  -  \3tr (y); 

Met  r«rfi_trw»Xitiid_orily_r»i(thod  -  Mtr(yj; 


enpt ies 


spaces 

spaces 


enfcedded.txQS  irtPfaddi-d_dlrK 


First  attempt . 
Second  Attempt  - 


Mat  flrst_f1e1d  -  Vrstr  CfM  :Fa-z 
Met  flrst.fleld  -  JUnrstrCpi 


3;« 


anvary  February  I, 


Met  1ast_f1eld  -  Vrstr((Ti  :r«d  »*?1te  blue  •)); 

note  *  vdq>  .  doable  <?.-ote  character 
Mat  1n_f11e  -  Vrstr( 

TC_T _ 022_trcmCaT£0_o*.  v_cc*»*a_3c»»eOQs-f  P_vc_lcjLP_vc_lc_£NPty^lds_Spaces  jENbTM>s_£rfcCt.*s ,  t«t ) 


.ccc. 

’.ccc. 

".ccc. 

.ccc,'. 

“.ccc. 

.ccc. 

".ccc." 

.ccc.*.* 

*,ccc. 

’.ccc. 

".ccc. 

.ccc. 

■ccc."  blue 

.ccc.*.* ’.red. . 

.ccc.*  .white 
*.ccc.*aus’* 


BLUB.  ” 
MHXTB. 
*BCO 
BLUE’ 

red.*" 


-fCCC.”.*lrfVlte* 
--.CCC.*  BED 

-•.ccc.dilte 


The  following  is  the  SAS  dataset  created  by  importing  the  above  fixed  delimited  text  file  into  SAS: 


&  M  F2 

&  F3  £  F4 

&  F9 

1 

CECEM&r  pm 

Fwk  Field  concerts  uc  *c  some_txq_Se*ds  e—p»es  spaces  errtwodec.oos  en-teodedjJTr-s  ccc 

“blue 

? 

— asa 

Last  Field  contents  «c  k  »ome_tiiq_f»e4ds  empties  spaces  efbedded.tKds  erbedqed_dfns  ccc 

3 

BBI 

‘•Jet  cortans je3d_ou»i*y*c  -  ‘.stnyi  ccc 

4 

s 

E. 

.WHICH.*.  aaa 

%tattdjftn  •  HiinKkXfc _ ccc 

HKMjiO  -Hoflc _ tee _ 

XMMB  _  > \flaTvCdQ>)  _  B 

Bt'JE- 

mm/,  — i 

fM  *  = 

7 

8 

JlAT  ]  BBI 

’.lei «i^_*r»a»ca*ed_or#y_meihod  •  >  ccc 

-r 

9 

- Ml 

ccc 

10 

e«c*errfeer - eee 

Fw»!  eBwwpT  *Je1  HreJ_fi«<d  •  |A-Z  P>.  ccc 

W.UE- 

11 

OCTOBER  - eee 

Second  *Omy*  *JMBrel_Wd»*imir<|T»|erNaryt1ifcruerjl  no^emberldecwr^erl  *>).  ccc 

'red.** 

12 

NO’-EWBER'  . eee 

ccc 

13 

DECEMBER  - eea 

b«.W3  ■  *AiwBeii7i*edh*«le<bue»  *JJ:  ccc 

Wue.... 

14 

- sea 

ccc 

/red” 

IS 

..FEBRUARY  — eaa 

NOTE  <ds>  •  doubk  quote  character  ccc 

.vidwie 

IF. 

march  bbb 

%4et  er_*1e  •  ^nrslrl  ccc 

blue*** 

17 

1R 

19 

i  - sae 

JUNE*  - aaa 

’C_T _ Q22_’*3ijMCATED_OCY_cowYTia_someOOs_FF_uc_k_LF_uc_k_Eirpl>Fl<>s_Spece4_E}nb'rXQs_Ent£)LMs  cke!>.  ccc 

.. — ♦  c<x 

blue 

20 

21 

AUGUST  .  aaa 

mAmmmmmmmmmmmmmmmmmmmmMmmmmmmmmmimAmmmmmm—mmmAmmmmm—mm  _  GOC 

— 1  . . - . - . . .  CCC 

RED 

*hte 

The  following  is  the  original  broken  delimited  text  file  having  been  fixed  using  the  truncated-only  method: 


TWUNC^TTO.OHlV.krr-*CO.RESULTS  -  T*:»ep»d  ' 

f<»  1s*l  format  S'i*«v 


'-aaa, 

-aai," 

- -aaa," 

I " , "" .march - AAA , 

. aaa,“ 

I"  .  My,  "  ,  - AAA  , 

I  JJJ. 

■  V  , - Add, 

I  aaa. 

hiptmlmi  , - ddd, 

I  OCKIBLM  ",  aaa." 

I . HOVFNBFR , - aaa, 

I  DiciwBlR  ,  -  aaa,** 

I ------- --aaa. 

. , , r lbruary" ,  aaa." 

I  Mrtn, - aaa," 

I .  aprll,,",- . aaa, 

I  . - - AAA  , - 

I  sunk  '  *  ,**,——— - aaa,*”' 

I , - aaa, — 

I  "aucust  ", - aaa,"- 


Flrst  FleTiTcontents:  ^jc Tc soee_txq_fie1ds 
Last  Fluid  concents:  uc  1c  saee_txq_f1  "Ids 
Met  corca1ns_text_qua11fyirq  -  *atr(y): 

Met  fld.dle  -  Snrstr (\x2c); 

Met  rue. fields  -  NstrfS): 

Met  txq  •  *str(Mdq>). 

Met  run  appended  method  -  %str(y>: 

Met  i  un_Cr uncat ed_nnl y_eet hod  -  t»tr{y); 


eept 1 es 
eepc 1 es 


spaces 

spaces 


embedded. t xqs 
eebedded.txqs 


e»bedded_dl»s 

eibedded.dles 


BLUE 


First  at t wept: 
Second  attcrpt: 


Met  flrst_fle1d  ■ 
Met  flrsc.fleld  < 


Burst 

Nnrsti 


Wii; 


*  J  l->); 

aruary  Ifebruary  | . .  I  noveebcr  I  decerber  |  *»; 


.ccc, 

”.ccc," 

".ccc,"*' 

.ccc,", 

“.ccc," 

,CCC,"  WRITE , 
CCC," 

, CCC, "."".’"'RED 
“.CCC, 

",CCC.  BLUE' 

",ccc, . .red, . 

.ccc. 

",ccc,"  '  blue',, 

/".red. . 

",CCC,"  , white 

", CCC, "BLUE""  ,  * 


Met  last. field  -  Nnrstr«?1  :r«d|wh1te|blue|  *)); 

•  CCC. 

noil:  «dg»  -  double  quote  character 
Met  lr_flle  -  Bnrstrf 

Tc_r_022.tRUNCATKD.ONLY_coeea.soeeDqs.rr  .uc.Ic.Lr.uc.Ic.tepcyt  1ds.spaces.KvbTxqs.KebDLNS.exe); 
- ,ccc,"' , ’white" 

— - . , . . . . . , - — - ..... — ... - - - - ,  ccc,  "blue 

• . . . . . .  CCC , "  RED 

. - . ", ccc, white 
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The  following  is  the  SAS  dataset  created  by  importing  the  above  fixed  delimited  text  file  into  SAS: 


1 

? 

n 

&  F2 

&  « 

F« 

r*  FS 

•DECEMBER' 

-•« 

Firgt  F*Wd  corMKa  uc  k  8  rwcw  •'rfcedded.b'qt  orttdM.dmi 

ccc 

“bue 

JANUARY  ’ 

- 555 

Lwl  F^io  c-crterts  uc  k  *poc«  yr-bcdded.trre  efrfceoded.alr'ns 

ccc 

3 

- Mi 

*.l«  cortain»_iext_aual^/«no  -  *»sfrL/; 

ccc 

4 

*  MARCH  * 

••Mi 

*  .Its  -  \iw*tn'otXl 

ccc 

BLUE" 

5 

- Mi 

Met itur  •  \*»«'5l , 

ccc 

6 

7 

t 

- Mi 

*.M  beg  • 

‘.l«r-r_apee«)ec_*^ero<l 
.ler  r.r_tr-ncaled_on»/_met>od  -  i 

ccc 

viwnE  . 

•jut* 

saa 

:cc 

•  -RED 

t 

- as 

ccc 

10 

11 

MC4 embei 

OCTOB01 

- Mi 

Firal  «****«  '  Met  4r«!_i«fcS  •  *  .roelit  |A-Z  f|) 

Second  •kte'^tx  MM  W«_Wd»  %*Titr  (7»i«ru*ry4ebrui»Yl  lnc.,«nib*eidec«*nber  “)>. 

ccc 

ccc 

BIUE 

i? 

■NOVEMBER* 

- 5M 

ccc 

13 

DECEMBER 

- au 

‘.le!  Iwt.iefc  •  '.nrcirli red  Atteeitfu*.  *t| 

coc 

IS 

■  FEBRUARY 

- 4M 

NOTE  Aj  '  •  double  quote  cherectar 

ccc 

«t*te 

16 

mtrcF 

- BM 

Met  in_f»k  ■  %nretn 

ccc 

BLUr  * 

17 

apr. 

. saa 

T  C_T _ Q22_TRiJMCATED_ONLY_cc*Tvra_*£nrieOO»_FF_uc_lc_Lf  _uc_k_Empt>,F  Jd5_Scace5_EmbT)CQ»_&r6C4.Ms  ttl, 

coc 

-  '  m - —  ~ - ~im  "Vi" . - - — - 

70 

ccc 

RED 

71 

AUGUST 

- we 

- - -  CCC 

wbie _ 

EXAMPLE  #2 

The  following  broken  delimited  text  file  contains  both  truncated  records  and  appended  records: 
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The  following  is  the  above  broken  delimited  text  file  having  been  fixed  using  the  appended  method. 
Notice  the  few  double  quotes  that  get  matched  to  one  record’s  last  field  instead  of  the  following  record’s 
first  field.  This  is  unavoidable  since  these  double  quotes  could  logically  belong  to  either  field. 


IPQIHD  MfTMOD  M 

Stim-r* 

N«  tu  hrr nt  H* 

'jmiumy  ■**■* 

•••Ml, 

-  IM, 

*,  lijUK. 

Ml. 

-  Ml, 

• 

*  .My!  “  ,  - 

|t" 

:::MI, 

1557?*??. -!!* . 

Ml, 

-;-«M. 

■arch, . 

—  Ml, 

I  MS  -  - 

B*XM  ■ - »u . 

J - - - »M, 

I'AUM**  '  . . »M, 


•  »rjt  »»e1d  contents;  ue  lc  30M_c*q_f 'elds  ewp*1*»  jnc«  «*fcedded_t)iqs  embedded.  (Pm 

list  »ie1d  comeots:  uc  U  *eld»  ewq*1«s  jpscej  ewhedded.r  «qj  e«t>edded-di«a 

Met  co»x«in».te»x-jqo»ltfylr*9  -  H*trCy>; 

Met  f1d_dl*  -  Vr5tr(  «2C): 

Met  n*W_M«1d5  -  bJtr(S): 

Met  tMJ  •  *3tr :\.do*); 

Me*  run_appended_*ethod  -  *5tr{y); 

Met  ryn_trunc«ted-on1y_»e*hod  -  V5tr-(n); 

Mrst  attempt  Me*  t1rs*_tieid  -  Vr»tr<C?1 : U-Jt  J*)); 

Second  ttteopt:  Vice  f1rsc_fle1d  •  VrstrfC’l  ja»*ary|febrvary| , ,  no«,**fc«r Idecepber  |  •)); 

Met  last_M«1d  -  irr*tr((?i:redl»#»uelb1uei  •»; 

wo't !  *dq*  -  double  quote  character 
Met  1*\_fHe  -  Knr jtr< 

»C_»*_W2_>ret**«D_4*«D_!»’.*<*rio_cc»«*_3<o»eOQ3_^r_uc_tc_Lr_uc_tc-i»rtyf  ids-jpaces-irtnos-Catco* 


t«t); 


.ccc,  - b 

".ccc." 

".ccc. —  .* 

.CCC,".  itut.  — *' 

".ccc." 

.CCC."  l*«Tt. 

•CCC,  .  »  *4D 

".CCC. 

".ccc.  ’•cue 
"  .ccc.  ,red.“' 
.ccc. 

'.ccc. '  I1H  . 

.ccc.".*'.*-ed." . 

.ccc,*  .white 
".ccc. "•cot""  . 
.ccc. 

— .ccc.*  , 'white* 

— - .ccc.  “blue  * 

- .CCC,-  •to" 

- ‘.ccc.ihitr 


UlCdl 


/ 


The  following  is  the  SAS  dataset  created  by  importing  the  above  fixed  delimited  text  file  into  SAS: 


EXAMPLE  #3 

The  following  delimited  text  file  is  in  an  “already-fixed”  state  since  it  contains  no  truncated  or  appended 
records: 
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Here  are  the  results  using  the  appended  method  that  show  that  the  “already-fixed”  state  has  been 
maintained  except  for  a  few  double  quotes  that  get  matched  to  one  record’s  last  field  instead  of  the 
following  record’s  first  field.  This  is  unavoidable  since  these  double  quotes  could  logically  belong  to  either 
field. 


The  following  is  the  SAS  dataset  created  by  importing  the  above  fixed  delimited  text  file  into  SAS: 


&  n 

t&  F2 

&  FJ 

&  F4 

A  rs 

i 

*DECEWE£R' 

First  Field  ccraeres  uc  k  some_t*qJ«tos  empoes  spaces  trteMM  aqs  errfcedded_dms 

ccc 

“blue, 

2 

IBM  Field  contents  uc  Ic  scm*_txq_fielas  emetics  spaces  errbeddedjao*  errbeddedjJlme 

ccc 

3 

- MB 

’.let  certains  Jextjjualifying  » 

ccc 

4 

5 

'  MARCH  * 

— aaa 

*.letfld_dlrr  • ’.nrsarl/jc2Ci; 

ccc 

BLUE- 

6 

may. 

— aaa 

Met  txa  *  *.str!Vdq- 

ccc 

tvWTE.  , 

7 

- Mi 

’.let  rur_  appended  _  method  •  ’.satyr 

ccc 

8 

•JULY" 

- Ml 

runjiv*e*»4.ortirjin**od  -  %»tr(y). 

CCC 

••RED 

3 

ccc 

10 

September 

- BM 

First  attempt  *.l«t  irstjield  •  ’.nrsarfl"*  JA-2  ft' 

ccc 

BLUE 

11 

OCTOBER  sm 

S«<x*J  An»m{X  -  %ntw(A  1). 

ccc 

\r#d.* 

12 

"NOVEMBER’ 

- aaa 

ccc 

13 

DECEMBER 

—  -aaa 

Met  last_fiela  • ‘.nrs5rt  |?i  redUt>3ettue  M 

ccc 

blue' . 

14 

- -Mi 

ccc 

\rad  “ 

13 

•  FEBRUARY 

- aaa 

NOTE  edq  s  a  double  oucte  character 

ccc 

white 

IS 

march 

Mi 

Met  in  Jile  -  ‘.nrstr 

ccc 

BUTE”  .  • 

17 

- Ml 

T  Ca»e_022_cornr»e_»orrieOQs_FF_ucJc_LF_uc_lc_E»t^T:lde_Spacea_Ert)T  XQs_Em60LM»  txt) 

ccc 

ccc 

19 

?D 

. see 

ccc 

blue 

RED 

AUGUST 

ccc 

while 

aaa 

HELPFUL  REMINDERS  AND  OTHER  IMPORTANT  INFORMATION 

•  The  file_fixing_tool  will  not  work  on  space-delimited  text  files  and  has  only  been  tested  on  files 

containing  the  six  already  mentioned  delimiters  (i.e. ,  comma,  caret,  colon,  semicolon,  pipe,  and  tab). 
A  workaround  for  space-delimited  files  would  be  to  use  a  command  or  short  script  to  change  every 
occurrence  of  the  space  character  to  one  of  the  above  delimiters  not  already  existing  in  your  broken 
file.  If  the  delimiter  already  exists,  then  your  workaround  would  need  to  employ  a  bit  of  indirection  in 
terms  of  first  changing  every  occurrence  of  your  chosen  delimiter  to  a  character  not  existing  (i.e.,  a 
non-existing  character)  in  your  file  and  then  changing  every  occurrence  of  a  space  character  to  your 
chosen  delimiter.  Then  run  the  file_fixing_tool  and  fix  your  file.  Now  take  the  fixed  text  file  and 
change  every  occurrence  of  your  chosen  delimiter  back  to  a  space  character  and  change  every 
occurrence  of  your  non-existing  character  back  to  your  chosen  delimiter.  You  now  have  a  fixed 
space-delimited  text  file. 
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•  While  still  on  the  topic  of  space-delimited  text  files,  it  is  worth  noting  that  delimited  text  files  that  can 
be  imported  using  List  Input  (Base  SAS)  or  Modified  List  Input  (Base  SAS),  where  two  consecutive 
spaces  can  mean  the  end  of  a  field,  are  not  the  same  as  space-delimited  text  files  and  need  to  first  be 
turned  into  space-delimited  text  files  (and  then  use  the  steps  in  the  above  bullet). 

•  If  you  encounter  the  “missing  semicolon”  error,  then  restart  the  file  fixing  tool. 

•  Always  use  the  hex  value  (i.e.,  \x2C)  for  the  comma  character  within  the  following  macro  variable 
values  in  the  everything_macro  program: 

1.  flddlm 

2.  first_field 

3.  Iast_field 

Failure  to  do  so  will  result  in  a  “More  positional  parameters  found  than  defined”  error. 

•  Depending  on  the  length  of  your  fi rst  field  and  last  field  patterns  within  these  macro  variables,  you 
might  encounter  the  following  warning: 

“WARNING:  Your  string  has  more  than  262  characters.  You  might  have  unbalanced  quotation 
marks.” 

Please  ignore  this  warning  if  your  patterns  are  longer  than  262  characters. 

•  Always  remove  the  header  record  from  the  top  of  your  broken  delimited  text  file  before  running  the 
file  fixing  tool.  After  running  the  file  fixing  tool,  reattach  the  header  record  to  the  top  of  your  fixed 
delimited  text  file. 

•  Any  use  of  parentheses  within  your  first  field  and  last  field  patterns  must  be  non-grouping 
parentheses  such  as  (?:...).  Otherwise,  the  capture  buffer  numbering  part  of  the  file  fixing  tool’s 
processing  will  be  destroyed  along  with  your  results. 

•  Any  use  of  the  “or”  operator  within  your  fi  rstfield  or  lastfield  patterns  must  be  enclosed  in  non¬ 
grouping  parentheses.  For  example: 

(?:A|B|C) 

•  Sometimes  broken  delimited  text  files  contain  categorical  data  values  in  the  last  field  where  these 
data  values  begin  with  characters  that  other  shortened  data  values  begin  with.  For  example,  the  field 
might  contain  the  following  values:  C,  CM,  CAT,  D,  DA,  DOG. 

The  pattern  containing  multiple  “or”  operators  would  need  to  contain  the  following  order: 

(?: CM | CAT | C | DOG | DA |D) 

If  the  'C'  in  the  above  pattern  occurs  before  the  'CM'  or  'CAT' , 
then  only  the  'C'  in  'CAT'  will  get  matched  for  the  last  field  and 
the  'AT'  could  get  matched  to  the  following  record's  first  field 
depending  on  your  first  field  pattern. 


CONCLUSIONS 

For  the  f i I e_f ix i n g  too I  to  work  correctly,  your  broken,  delimited  text  file  must  adhere  to  the  9  rules 
contained  in  this  paper’s  section  “DEFINING  A  DELIMITED  TEXT  FILE.” 

Knowing  which  method  (appended  or  truncated-only)  to  use  is  important. 

If  you  are  able  to  develop  last  field  and  first  field  patterns  that  identify  and  isolate  these  fields,  then  use 
the  appended  method.  If  not,  then  you  can  still  use  the  truncated-only  method  as  long  as  your  broken  file 
contains  only  truncated  records. 
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